Performance Analysis of Different Smoothing Methods on n-grams for Statistical Machine Translation
نویسندگان
چکیده
Smoothing techniques adjust the maximum likelihood estimate of probabilities to produce more accurate probabilities. This is one of the most important tasks while building a language model with a limited number of training data. Our main contribution of this paper is to analyze the performance of different smoothing techniques on n-grams. Here we considered three most widely-used smoothing algorithms for language modeling: Witten-Bell smoothing, Kneser-Ney smoothing, and Modified Kneser-Ney smoothing. For the evaluation we use BLEU (Bilingual Evaluation Understudy) and NIST (National Institute of Standards and Technology) scoring techniques. A detailed evaluation of these models is performed by comparing the automatically produced word alignment. We use Moses Statistical Machine Translation System for our work (i.e.Moses decoder, GIZA++, mkcls, SRILM, IRSTLM, Pharaoh, BLEU Scoring Tool). Here machine translation approach has been tested on German to English and English to German task. Our obtain results are significantly better than those obtained with alternative approaches to machine translation. This paper addresses several aspects of Statistical Machine Translation (SMT). The emphasis is put on the architecture and modeling of an SMT system.
منابع مشابه
N-Gram Language Modeling for Robust Multi-Lingual Document Classification
Statistical n-gram language modeling is used in many domains like speech recognition, language identification, machine translation, character recognition and topic classification. Most language modeling approaches work on n-grams of terms. This paper reports about ongoing research in the MEMPHIS project which employs models based on character-level n-grams instead of term n-grams. The models ar...
متن کاملLarge Language Models in Machine Translation
This paper reports on the benefits of largescale statistical language modeling in machine translation. A distributed infrastructure is proposed which we use to train on up to 2 trillion tokens, resulting in language models having up to 300 billion n-grams. It is capable of providing smoothed probabilities for fast, single-pass decoding. We introduce a new smoothing method, dubbed Stupid Backoff...
متن کاملIRSTLM: an open source toolkit for handling large scale language models
Research in speech recognition and machine translation is boosting the use of large scale n-gram language models. We present an open source toolkit that permits to efficiently handle language models with billions of n-grams on conventional machines. The IRSTLM toolkit supports distribution of ngram collection and smoothing over a computer cluster, language model compression through probability ...
متن کاملImproving Sampling-based Alignment by Investigating the Distribution of N-grams in Phrase Translation Tables
This paper describes an approach to improve the performance of sampling-based multilingual alignment on translation tasks by investigating the distribution of n-grams in the translation tables. This approach consists in enforcing the alignment of n-grams. The quality of phrase translation tables output by this approach and that of MGIZA++ is compared in statistical machine translation tasks. Si...
متن کاملExploiting N-best Hypotheses for SMT Self-Enhancement
Word and n-gram posterior probabilities estimated on N-best hypotheses have been used to improve the performance of statistical machine translation (SMT) in a rescoring framework. In this paper, we extend the idea to estimate the posterior probabilities on N-best hypotheses for translation phrase-pairs, target language n-grams, and source word reorderings. The SMT system is self-enhanced with t...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2012